-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: don't necessarily include all artifacts from templates in node outputs #13066
Conversation
…utputs Signed-off-by: Julie Vogelman <[email protected]>
} | ||
we.Template.Outputs.Artifacts[i] = art |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of updating this directly, create a new list of the artifacts that we actually saved and return that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. So before we were using the we.Template.Outputs.Artifacts
to pass to ReportOutputs
but now we are creating a fresh wfv1.Artifacts
object that will only include the successfully saved artifacts. This makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I do see a small issue here, but I think this already existed beforehand:
If the wait
container is interrupted during this loop, some artifacts may save to S3 (etc) but not be reported
But since adding artifacts to Outputs.Artifacts
wouldn't report them until the very end anyway, I think this doesn't change any existing behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the wait container is interrupted during this loop, some artifacts may save to S3 (etc) but not be reported
How do you mean it wouldn't be reported? Since the background context is passed into both SaveArtifacts() and ReportOutputs() below, I believe all of that logic is still supposed to be executed. Is there something different you're referring to besides the context being cancelled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the loop is long or the container is otherwise non-gracefully terminated, the request context won't matter.
Also realized that saving artifacts can be parallelized here to reduce that chance and improve throughput (related: #12442). Similarly pre-existing logic though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is basically if this Pod receives a SIGKILL? I don't think there's anything that can be done in that case, is there? (at least from the Workflow Pod side)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More or less yes, but there are things we can do to mitigate that or detect that scenario, such as parallel uploads and writing pieces of the report as each artifact completes uploading (although k8s does not have streaming requests afaik, so that would put more strain on the api server when there are many artifacts).
I also recently remembered we had #12413 merged for 3.6 that does help prevent that too.
(To be clear, this is pre-existing logic / an incidental finding, so not going to block review on that, but there are perhaps actions we should take around this case. Perhaps better discussed in #12993 though; although that's a specific case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
Signed-off-by: Julie Vogelman <[email protected]>
Signed-off-by: Julie Vogelman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} | ||
we.Template.Outputs.Artifacts[i] = art |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. So before we were using the we.Template.Outputs.Artifacts
to pass to ReportOutputs
but now we are creating a fresh wfv1.Artifacts
object that will only include the successfully saved artifacts. This makes sense.
…, including when no artifact is written Signed-off-by: Julie Vogelman <[email protected]>
c14ffb5
to
c650be0
Compare
I was thinking about this and it will probably be fixed by parallel artifact GC (#11768), as that inherently has to process each artifact independently. Vaguely related to this, I was wondering if creating more |
Signed-off-by: Julie Vogelman <[email protected]>
Signed-off-by: Julie Vogelman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM. I left a few style comments on the tests, one suggestion for the test case, and a question on the saving + reporting. Still learning my way around the artifacts part of the codebase
} | ||
we.Template.Outputs.Artifacts[i] = art |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I do see a small issue here, but I think this already existed beforehand:
If the wait
container is interrupted during this loop, some artifacts may save to S3 (etc) but not be reported
But since adding artifacts to Outputs.Artifacts
wouldn't report them until the very end anyway, I think this doesn't change any existing behavior
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Julie Vogelman <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Julie Vogelman <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Julie Vogelman <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Julie Vogelman <[email protected]>
Signed-off-by: Julie Vogelman <[email protected]>
7fe0f16
to
9e5b1c5
Compare
Signed-off-by: Julie Vogelman <[email protected]>
Signed-off-by: Julie Vogelman <[email protected]>
Thanks for fixing the lint issue @agilgur5 ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some very tiny comments on the tests below, otherwise LGTM. Marking as approved so you can merge once you fix or answer those
fmt.Printf("artifact key: %q\n", artifactKey) | ||
if err != nil { | ||
panic(err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be below the continue
conditional? Since it's not used by it, so can be skipped in that case. Unless you want to print each key?
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Julie Vogelman <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Julie Vogelman <[email protected]>
…utputs (argoproj#13066) Signed-off-by: Julie Vogelman <[email protected]> Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Yulin Li <[email protected]>
…utputs (#13066) Signed-off-by: Julie Vogelman <[email protected]> Co-authored-by: Anton Gilgur <[email protected]> (cherry picked from commit 2ca4841)
The fact that it's being listed in the NodeStatus is causing ArtifactGC to attempt to delete it (and fail doing so).
Fixes: #12845
Motivation
If an Artifact file didn't get written, it shouldn't be in the outputs for the TaskResult (and thus the Outputs for the Node) just because it was listed as an Artifact in the Workflow.
Modifications
Only if the Artifact file is successfully written does the wait container include it as one of the Output Artifacts in its
WorkflowTaskResult
. The Workflow Controller uses thatWorkflowTaskResult
information to populate theNodeStatus
of the Workflow. This is the information that ArtifactGC uses to determine what needs to be deleted.Verification
The person who submitted the Issue tested the image I made, and it repeatedly worked to solve his issue of ArtifactGC failing. (Note it also prevented his other artifacts from getting deleted, because the failed deletion preempted the other deletions from being attempted (as @agilgur5 pointed out, this could be another improvement, to continue the deletion process independent of failure)).
I did start to create an e2e test for this issue. But it seems that with minio, I can't repeat the same failure using the "main" branch. It seems that attempting to delete an artifact which doesn't exist doesn't result in a failure for some reason.